Search CORE

8 research outputs found

String Covering: A Survey

Author: Mhaskar Neerja
Smyth W. F.
Publication venue
Publication date: 21/11/2022
Field of study

The study of strings is an important combinatorial field that precedes the digital computer. Strings can be very long, trillions of letters, so it is important to find compact representations. Here we first survey various forms of one potential compaction methodology, the cover of a given string x, initially proposed in a simple form in 1990, but increasingly of interest as more sophisticated variants have been discovered. We then consider covering by a seed; that is, a cover of a superstring of x. We conclude with many proposals for research directions that could make significant contributions to string processing in future

arXiv.org e-Print Archive

Episciences.org

Computation of the suffix array, burrows-wheeler transform and FM-index in V-order

Author: Daykin Jacqueline
Mhaskar Neerja
Smyth W. F.
Publication venue
Publication date: 01/01/2021
Field of study

V-order is a total order on strings that determines an instance of Unique Maximal Factorization Families (UMFFs), a generalization of Lyndon words. The fundamental V-comparison of strings can be done in linear time and constant space. V-order has been proposed as an alternative to lexicographic order (lexorder) in the computation of suffix arrays and in the suffix-sorting induced by the Burrows-Wheeler transform (BWT). In line with the recent interest in the connection between suffix arrays and Lyndon factorization, in this paper we obtain similar results for the V-order factorization. Indeed, we show that the results describing the connection between suffix arrays and Lyndon factorization are matched by analogous V-order processing. We also describe a methodology for efficiently computing the FM-Index in V-order, as well as V-order substring pattern matching using backward search

Aberystwyth Research Portal

Research Repository

Practical KMP/BM Style Pattern-Matching on Indeterminate Strings

Author: Dehghani Hossein
Mhaskar Neerja
Smyth W. F.
Publication venue
Publication date: 02/05/2022
Field of study

In this paper we describe two simple, fast, space-efficient algorithms for finding all matches of an indeterminate pattern

p = p[1..m]

in an indeterminate string

x = x[1..n]

, where both

p

and

x

are defined on a "small" ordered alphabet

\Sigma

-

say,

\sigma = |\Sigma| \le 9

. Both algorithms depend on a preprocessing phase that replaces

\Sigma

by an integer alphabet

\Sigma_I

of size

\sigma_I = \sigma

which (reversibly, in time linear in string length) maps both

x

and

p

into equivalent regular strings

y

and

q

, respectively, on

\Sigma_I

, whose maximum (indeterminate) letter can be expressed in a 32-bit word (for

\sigma \le 4

, thus for DNA sequences, an 8-bit representation suffices). We first describe an efficient version KMP Indet of the venerable Knuth-Morris-Pratt algorithm to find all occurrences of

q

y

(that is, of

p

x

), but, whenever necessary, using the prefix array, rather than the border array, to control shifts of the transformed pattern

q

along the transformed string

y

. We go on to describe a similar efficient version BM Indet of the Boyer- Moore algorithm that turns out to execute significantly faster than KMP Indet over a wide range of test cases. A noteworthy feature is that both algorithms require very little additional space:

\Theta(m)

words. We conjecture that a similar approach may yield practical and efficient indeterminate equivalents to other well-known pattern-matching algorithms, in particular the several variants of Boyer-Moore

arXiv.org e-Print Archive